library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(modelr)
library(hexbin)
options(na.action = na.warn)
data(diamonds)
Your client, the world’s only ethical diamond miner has asked you to build a model predicting diamond prices based on the four C’s: Carats (weight), Cut, Color, Clarity. You recognize that diamonds are a hoax, but a gig’s a gig so you have accepted the contract.
#a
ggplot(diamonds, aes(cut, price)) + geom_boxplot()
#b
ggplot(diamonds, aes(clarity, price)) + geom_boxplot()
#c
ggplot(diamonds, aes(color, price)) + geom_boxplot()
#d
From the plot, we can see that generally lower quality diamonds often have higher prices which is very suprising.
#a
dia <- ggplot(diamonds, aes(carat, price))
#b
dia + geom_hex()
dia + geom_hex(bins = 10)
#d
dia + geom_hex(bins = 20)
dia + geom_hex(bins = 30)
dia + geom_hex(bins = 40)
dia + geom_hex(bins = 50)
#c
The "bins" is the numeric vector giving number of bins in both vertical and horizontal directions.
"bins= 10" gives 10 bins in both vertical and horizontal directions.
In addition, it also contains many outliers in the plot.
dia_red <- filter(diamonds, carat <= 2)
dia_red
## # A tibble: 52,051 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 52,041 more rows
#a
dia_2 <- diamonds %>% filter(carat <= 2) %>% mutate(log_pr = log2(price), log_car = log2(carat))
dia_2
## # A tibble: 52,051 x 12
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 52,041 more rows, and 2 more variables: log_pr <dbl>,
## # log_car <dbl>
#b
ggplot(dia_2, aes(log_car, log_pr)) + geom_hex(bins = 50)
lm_dia2 <- lm(log_pr~log_car, data = dia_2)
lm_dia2
##
## Call:
## lm(formula = log_pr ~ log_car, data = dia_2)
##
## Coefficients:
## (Intercept) log_car
## 12.209 1.697
grid <- dia_2 %>%
data_grid(carat = seq_range(carat, 20)) %>%
mutate(log_car = log2(carat)) %>%
add_predictions(lm_dia2, "log_pr") %>%
mutate(price = 2 ^ log_pr)
ggplot(dia_2, aes(carat, price)) +
geom_hex(bins = 50) +
geom_line(data = grid, colour = "red", size = 1)
Since the residuals were minimized for the logged data, we should plot those residuals to see if a linear model was appropriate for the log-transformed data. Make a plot or two of the ‘log_resids’ and comment on it.dia_2 <- dia_2 %>%
add_residuals(lm_dia2, "log_resid")
ggplot(dia_2, aes(log_car, log_resid)) +
geom_hex(bins = 50)
plot(lm_dia2,which=1:2)
#a
ggplot(dia_2, aes(cut, log_resid)) + geom_boxplot()
ggplot(dia_2, aes(color, log_resid)) + geom_boxplot()
ggplot(dia_2, aes(clarity, log_resid)) + geom_boxplot()
The associations are as expected. As the quality of the diamond increases, it tends to become more relative to price.
#b
lm_dia2 <- lm(log_pr ~ log_car + color + cut + clarity, data = dia_2)
#c
summary(lm_dia2)
##
## Call:
## lm(formula = log_pr ~ log_car + color + cut + clarity, data = dia_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10924 -0.12391 -0.00224 0.11703 2.74693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.221e+01 1.765e-03 6922.181 < 2e-16 ***
## log_car 1.889e+00 1.170e-03 1614.839 < 2e-16 ***
## color.L -6.353e-01 2.991e-03 -212.384 < 2e-16 ***
## color.Q -1.362e-01 2.774e-03 -49.103 < 2e-16 ***
## color.C -1.994e-02 2.586e-03 -7.712 1.26e-14 ***
## color^4 1.998e-02 2.364e-03 8.454 < 2e-16 ***
## color^5 6.179e-05 2.215e-03 0.028 0.9777
## color^6 4.488e-03 1.993e-03 2.252 0.0243 *
## cut.L 1.727e-01 3.459e-03 49.923 < 2e-16 ***
## cut.Q -4.770e-02 3.040e-03 -15.692 < 2e-16 ***
## cut.C 1.571e-02 2.623e-03 5.990 2.11e-09 ***
## cut^4 -3.070e-03 2.092e-03 -1.467 0.1423
## clarity.L 1.293e+00 5.290e-03 244.403 < 2e-16 ***
## clarity.Q -3.148e-01 4.965e-03 -63.396 < 2e-16 ***
## clarity.C 1.618e-01 4.238e-03 38.168 < 2e-16 ***
## clarity^4 -7.572e-02 3.363e-03 -22.515 < 2e-16 ***
## clarity^5 2.873e-02 2.710e-03 10.600 < 2e-16 ***
## clarity^6 6.082e-04 2.336e-03 0.260 0.7946
## clarity^7 4.791e-02 2.055e-03 23.314 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1905 on 52032 degrees of freedom
## Multiple R-squared: 0.9815, Adjusted R-squared: 0.9815
## F-statistic: 1.535e+05 on 18 and 52032 DF, p-value: < 2.2e-16
dia_2 <- dia_2 %>%
add_residuals(lm_dia2, "log_resid2")
ggplot(dia_2, aes(log_car, log_resid2)) +
geom_hex(bins = 50)
a. Find the outliers that represent great buying or selling opportunities. (You don’t have to find the data points, just identify the right residuals.)
There are some diamonds with quite large residuals
dia_2 %>%
filter(dia_2$log_resid2 < 0) %>%
add_predictions(lm_dia2) %>%
mutate(pred = (2 ^ pred)) %>%
select(price, pred) %>%
mutate(difference= pred - price) %>%
summarise(sum(difference))
## # A tibble: 1 x 1
## `sum(difference)`
## <dbl>
## 1 7579811
Hopefully you can see why identifying outliers can be interesting for those motivated by money.
Opportunity <- dia_2 %>%
filter(abs(log_resid2) > 1) %>%
add_predictions(lm_dia2) %>%
mutate(pred = round(2 ^ pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
Explain each line of this model and try to find a reason for mispricing.
# log(price/predictPrice) >1; find diamonds with indicated price higher than average price
# get prediction
# select and then rank from lowest to highest
Exercises
In the plot of log_pr vs log_car, there are some bright vertical strips. What is causing those?
The bright vertical strips mean that there are some variations on `log_pr` although there seems to be a linear relationship between `log_pr` and `log_car`.If, what does that say about the relationship between price and carat? (Use any resource. If anyone is a chemistry major, they may know off-hand).
They have a positive correlation. If carat is increased by 1%, it will cause an b1% increase in price.Does the final, full model do a good job of predicting diamond prices? Should your client trust it when buying diamonds or setting diamond prices? Give some justification for your answer.
summary(lm_dia2)
##
## Call:
## lm(formula = log_pr ~ log_car + color + cut + clarity, data = dia_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10924 -0.12391 -0.00224 0.11703 2.74693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.221e+01 1.765e-03 6922.181 < 2e-16 ***
## log_car 1.889e+00 1.170e-03 1614.839 < 2e-16 ***
## color.L -6.353e-01 2.991e-03 -212.384 < 2e-16 ***
## color.Q -1.362e-01 2.774e-03 -49.103 < 2e-16 ***
## color.C -1.994e-02 2.586e-03 -7.712 1.26e-14 ***
## color^4 1.998e-02 2.364e-03 8.454 < 2e-16 ***
## color^5 6.179e-05 2.215e-03 0.028 0.9777
## color^6 4.488e-03 1.993e-03 2.252 0.0243 *
## cut.L 1.727e-01 3.459e-03 49.923 < 2e-16 ***
## cut.Q -4.770e-02 3.040e-03 -15.692 < 2e-16 ***
## cut.C 1.571e-02 2.623e-03 5.990 2.11e-09 ***
## cut^4 -3.070e-03 2.092e-03 -1.467 0.1423
## clarity.L 1.293e+00 5.290e-03 244.403 < 2e-16 ***
## clarity.Q -3.148e-01 4.965e-03 -63.396 < 2e-16 ***
## clarity.C 1.618e-01 4.238e-03 38.168 < 2e-16 ***
## clarity^4 -7.572e-02 3.363e-03 -22.515 < 2e-16 ***
## clarity^5 2.873e-02 2.710e-03 10.600 < 2e-16 ***
## clarity^6 6.082e-04 2.336e-03 0.260 0.7946
## clarity^7 4.791e-02 2.055e-03 23.314 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1905 on 52032 degrees of freedom
## Multiple R-squared: 0.9815, Adjusted R-squared: 0.9815
## F-statistic: 1.535e+05 on 18 and 52032 DF, p-value: < 2.2e-16
plot(lm_dia2)
dia_2 %>%
add_predictions(lm_dia2) %>%
add_residuals(lm_dia2) %>%
summarise(sq_err = sqrt(mean(resid^2)),
abs_err = mean(abs(resid)),
p975_err = quantile(resid, 0.975),
p025_err = quantile(resid, 0.025))
## # A tibble: 1 x 4
## sq_err abs_err p975_err p025_err
## <dbl> <dbl> <dbl> <dbl>
## 1 0.1904401 0.1480822 0.3855161 -0.3617156
From the plot, I think clients should trust it when buying diamonds because the full model is very good.